-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat: USDT probes for tokio task events #7717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
tokio/src/util/usdt.rs
Outdated
| fn task__terminate(task_id: u64) {} | ||
|
|
||
| fn task__waker__clone(task_id: u64) {} | ||
| fn task__waker__wake(task_id: u64) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea to represent waker as wake + drop and remove the need for wake_by_ref.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wasn't sure if I liked that 😅
638f594 to
3eb3cde
Compare
3eb3cde to
e3a7a09
Compare
|
Musings regarding usdt performance: oxidecomputer/usdt#490 |
|
I think I'm going to vendor the minimal subset of the code needed to support USDT on the platforms supported by the usdt crate. Removing any extra dependencies in the process. I'd like to work with the oxide devs to improve this but currently usdt brings in a lot of unnecessary dependencies, and it also doesn't allow cross compilation. Lastly I want to rework the generated code to reduce any performance impact. Because I will remove all dependencies, I'll move it back to a cfg tokio_usdt and remove the feature |
|
Are we using checked in assembly and Also, are the files in |
I had a lot of issues with using usdt directly.
The asm was originally generated using the proc macro, but has since been rewritten. Individual probes shouldn't need to be regenerated, but adding new probes will need some manual effort. I've tried to abstract the asm to some shared macros, I have some more work to do there. The macos code would require the most effort if new probes are needed but I will also document it. Some notable changes I've made to the asm are
What we lose by not using the usdt crate is the freebsd/illumos support. I don't have any freebsd/illumos setups available to test, and the code to support usdt on those platforms is the most complicated. I don't consider any of these changes impossible for the usdt crate to support, but it's going to take effort and considerable rewrites of large amounts of their code. Improving the asm is relatively easy and I'd be comfortable sacrificing the build constraints when that's done. However I do consider the concise asm to be a blocker since we really intend for this feature to be as close to 0 overhead as possible |
|
Thanks for the details! As you've said, I think it would be important to document these steps (maybe a README.md in the usdt directory). For Illumos support maybe we can convince someone at Oxide to add it, since it would benefit them perhaps the most. (-: |
| fn task_details_inner(task_id: u64, name: &str, file: &str, line: u32, col: u32) { | ||
| // add nul bytes | ||
| let name0 = [name.as_bytes(), b"\0"].concat(); | ||
| let file0 = [file.as_bytes(), b"\0"].concat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn task_details_inner(task_id: u64, name: &str, file: &str, line: u32, col: u32) { | |
| // add nul bytes | |
| let name0 = [name.as_bytes(), b"\0"].concat(); | |
| let file0 = [file.as_bytes(), b"\0"].concat(); | |
| fn task_details_inner(task_id: u64, name: &CStr, file: &CStr, line: u32, col: u32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the filename of a Location as a CStr will be stable in 1.92.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some concerns:
- This is the slow path - any conversions of &str -> &CStr should take place within this function.
- I would rather not require nightly for the file location. We can wait 6 weeks.
- Because the task name might contain a
\0somewhere in the middle, it ends up being a fair bit of code and error handling to construct such a string.
Can you share more details about this? Do I understand correctly that this works by replacing nop instructions with a different instruction (call or jmp or similar)? |
That is correct. I'll try and verify the exact instructions that get replaced (iirc linux will use an interrupt and macos uses a function call), but I've observed in both macos and linux when using lldb that there is a nop instruction and some simple register ops for the generated code of wake_by_ref. I'm unable to test it right now, but the eBPF docs claim that linux uses an interrupt: https://docs.ebpf.io/linux/concepts/usdt/#attaching-with-ebpf. When testing on aarch64-apple-darwin, I see a NOP being replaced with a FASTTRAP instruction. |
|
Here's some assembly differences, ignoring any label changes wake_by_ref.section .text.tokio::runtime::task::waker::wake_by_ref,"ax",@progbits
.globl tokio::runtime::task::waker::wake_by_ref
.p2align 4
.type tokio::runtime::task::waker::wake_by_ref,@function
tokio::runtime::task::waker::wake_by_ref:
.cfi_startproc
sub rsp, 24
.cfi_def_cfa_offset 32
+ mov rax, qword ptr [rdi + 16]
+ mov rax, qword ptr [rax + 72]
+ mov rax, qword ptr [rdi + rax]
+ nop
mov rax, qword ptr [rdi]
lea rcx, [rsp + 8]
lea rdx, [rsp + 16]
.p2align 4
.LBB193_1:
test al, 2
jne .LBB193_2
test al, 4
jne .LBB193_4
test al, 1
jne .LBB193_8
test rax, rax
js .LBB193_14
lea r8, [rax + 68]
mov sil, 1
jmp .LBB193_9
.p2align 4
.LBB193_2:
xor esi, esi
mov r9, rcx
xor r8d, r8d
mov qword ptr [r9], r8
cmp dword ptr [rsp + 8], 1
je .LBB193_11
jmp .LBB193_12
.p2align 4
.LBB193_4:
xor esi, esi
mov r8, rax
jmp .LBB193_9
.LBB193_8:
mov r8, rax
or r8, 4
xor esi, esi
.LBB193_9:
mov qword ptr [rsp + 8], 1
mov r9, rdx
mov qword ptr [r9], r8
cmp dword ptr [rsp + 8], 1
jne .LBB193_12
.LBB193_11:
mov r8, qword ptr [rsp + 16]
lock cmpxchg qword ptr [rdi], r8
jne .LBB193_1
.LBB193_12:
test sil, sil
je .LBB193_13
mov rax, qword ptr [rdi + 16]
add rsp, 24
.cfi_def_cfa_offset 8
jmp qword ptr [rax + 8]
.LBB193_13:
.cfi_def_cfa_offset 32
add rsp, 24
.cfi_def_cfa_offset 8
ret
.LBB193_14:
.cfi_def_cfa_offset 32
lea rdi, [rip + .Lanon.cd1a78ec98df3f76d56fd1466ac0099f.165]
lea rdx, [rip + .Lanon.cd1a78ec98df3f76d56fd1466ac0099f.166]
mov esi, 47
call qword ptr [rip + core::panicking::panic@GOTPCREL]poll.section .text.tokio::runtime::task::raw::poll::hd96dfd798919c755,"ax",@progbits
.p2align 4
.type tokio::runtime::task::raw::poll::hd96dfd798919c755,@function
tokio::runtime::task::raw::poll::hd96dfd798919c755:
.cfi_startproc
.cfi_personality 155, DW.ref.rust_eh_personality
- .cfi_lsda 27, .Lexception49
+ .cfi_lsda 27, .Lexception51
push rbp
.cfi_def_cfa_offset 16
push r15
.cfi_def_cfa_offset 24
push r14
.cfi_def_cfa_offset 32
push r12
.cfi_def_cfa_offset 40
push rbx
.cfi_def_cfa_offset 48
sub rsp, 512
.cfi_def_cfa_offset 560
.cfi_offset rbx, -48
.cfi_offset r12, -40
.cfi_offset r14, -32
.cfi_offset r15, -24
.cfi_offset rbp, -16
mov rbx, rdi
call qword ptr [rip + tokio::runtime::task::state::State::transition_to_running::h51d78340084c6090@GOTPCREL]
movzx eax, al
lea rcx, [rip + .LJTI77_0]
movsxd rax, dword ptr [rcx + 4*rax]
add rax, rcx
jmp rax
mov rax, qword ptr [rip + tokio::runtime::task::waker::WAKER_VTABLE::h0490a160f2a7ec56@GOTPCREL]
mov qword ptr [rsp], rax
mov qword ptr [rsp + 8], rbx
lea r14, [rbx + 32]
mov rax, rsp
mov qword ptr [rsp + 24], rax
mov qword ptr [rsp + 32], 0
mov qword ptr [rsp + 16], rax
cmp dword ptr [rbx + 56], 0
jne .LBB77_14
mov rdi, qword ptr [rbx + 40]
call qword ptr [rip + tokio::runtime::task::core::TaskIdGuard::enter::h31387a37d88abbe0@GOTPCREL]
- lea rdi, [rbx + 64]
mov qword ptr [rsp + 48], rax
+ mov r12, qword ptr [rbx + 40]
+ mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__start@GOTPCREL]
+ cmp word ptr [rax], 0
+ je .LBB77_5
+ mov rdi, r12
+ call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_start::probe_inner::hda61800a23fcc40d@GOTPCREL]
+.LBB77_5:
+ lea rdi, [rbx + 64]
lea rsi, [rsp + 16]
call simple_echo_tcp::main::{{closure}}::h5671ec2a34b607f8
mov ebp, eax
+ mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__end@GOTPCREL]
+ cmp word ptr [rax], 0
+ je .LBB77_8
+ mov rdi, r12
+ call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_end::probe_inner::h476c2bd4bfb2eb31@GOTPCREL]
+.LBB77_8:
lea rdi, [rsp + 48]
call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
test bpl, bpl
je .LBB77_10
mov rdi, rbx
call qword ptr [rip + tokio::runtime::task::state::State::transition_to_idle::ha74e5e89ce88109b@GOTPCREL]
movzx eax, al
lea rcx, [rip + .LJTI77_1]
movsxd rax, dword ptr [rcx + 4*rax]
add rax, rcx
jmp rax
mov rdi, r14
mov rsi, rbx
call qword ptr [rip + tokio::runtime::scheduler::multi_thread::handle::<impl tokio::runtime::task::Schedule for alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>::yield_now::hf966c4e666337dba@GOTPCREL]
mov rdi, rbx
call qword ptr [rip + tokio::runtime::task::state::State::ref_dec::h3033e08956b2d202@GOTPCREL]
test al, al
je .LBB77_55
mov rdi, rbx
call core::ptr::drop_in_place<tokio::runtime::task::core::Cell<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::h8c2b259dda47c6ee
jmp .LBB77_54
mov rdi, rbx
call core::ptr::drop_in_place<tokio::runtime::task::core::Cell<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle>>>::h8c2b259dda47c6ee
.LBB77_54:
mov esi, 384
mov edx, 128
mov rdi, rbx
call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
jmp .LBB77_55
lea r14, [rbx + 32]
+ mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
+ cmp word ptr [rax], 0
+ je .LBB77_38
+ mov rdi, qword ptr [rbx + 40]
+ mov esi, 1
+ call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
+.LBB77_38:
mov dword ptr [rsp + 280], 2
lea rsi, [rsp + 280]
mov rdi, r14
call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
.LBB77_39:
xor eax, eax
.LBB77_41:
mov rcx, qword ptr [rbx + 40]
mov qword ptr [rsp + 56], rcx
mov qword ptr [rsp + 64], rax
mov qword ptr [rsp + 72], rdx
mov dword ptr [rsp + 48], 1
lea rsi, [rsp + 48]
mov rdi, r14
call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
jmp .LBB77_42
.LBB77_10:
mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
cmp word ptr [rax], 0
je .LBB77_12
mov rdi, qword ptr [rbx + 40]
xor esi, esi
call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
.LBB77_12:
mov dword ptr [rsp + 48], 2
lea rsi, [rsp + 48]
mov rdi, r14
call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
xor ecx, ecx
.LBB77_25:
mov qword ptr [rsp + 288], rcx
mov qword ptr [rsp + 296], rax
mov qword ptr [rsp + 304], rdx
mov dword ptr [rsp + 280], 1
lea rsi, [rsp + 280]
mov rdi, r14
call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
.LBB77_42:
mov rdi, rbx
call tokio::runtime::task::harness::Harness<T,S>::complete::hbafcc89c840d433c
.LBB77_55:
add rsp, 512
.cfi_def_cfa_offset 48
pop rbx
.cfi_def_cfa_offset 40
pop r12
.cfi_def_cfa_offset 32
pop r14
.cfi_def_cfa_offset 24
pop r15
.cfi_def_cfa_offset 16
pop rbp
.cfi_def_cfa_offset 8
ret
.cfi_def_cfa_offset 560
+ mov rax, qword ptr [rip + __usdt_sema_tokio_task__terminate@GOTPCREL]
+ cmp word ptr [rax], 0
+ je .LBB77_46
+ mov rdi, qword ptr [rbx + 40]
+ mov esi, 1
+ call qword ptr [rip + tokio::util::usdt::usdt_impl::task_terminate::probe_inner::h7db442807ccea7e2@GOTPCREL]
+.LBB77_46:
mov dword ptr [rsp + 280], 2
lea rsi, [rsp + 280]
mov rdi, r14
call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
jmp .LBB77_39
.LBB77_14:
lea rax, [rip + .Lanon.5efc90b22d074bac3f60c9ed09935ae4.75]
mov qword ptr [rsp + 48], rax
mov qword ptr [rsp + 56], 1
mov qword ptr [rsp + 64], 8
xorps xmm0, xmm0
movups xmmword ptr [rsp + 72], xmm0
lea rsi, [rip + .Lanon.5efc90b22d074bac3f60c9ed09935ae4.76]
lea rdi, [rsp + 48]
call qword ptr [rip + core::panicking::panic_fmt::h5138da2ef87be35b@GOTPCREL]
ud2
jmp .LBB77_51
mov rdi, rax
call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
jmp .LBB77_41
call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
mov rdi, rax
call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
mov r15, rax
test rax, rax
je .LBB77_42
mov r12, rdx
mov rax, qword ptr [rdx]
test rax, rax
je .LBB77_30
mov rdi, r15
call rax
.LBB77_30:
mov rsi, qword ptr [r12 + 8]
test rsi, rsi
je .LBB77_42
mov rdx, qword ptr [r12 + 16]
mov rdi, r15
call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
jmp .LBB77_42
mov r14, rax
mov rsi, qword ptr [r12 + 8]
test rsi, rsi
je .LBB77_35
mov rdx, qword ptr [r12 + 16]
mov rdi, r15
jmp .LBB77_34
call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
mov r15, rax
- lea rdi, [rsp + 48]
- call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
+ mov rax, qword ptr [rip + __usdt_sema_tokio_task__poll__end@GOTPCREL]
+ cmp word ptr [rax], 0
+ je .LBB77_19
+ mov rdi, r12
+ call qword ptr [rip + tokio::util::usdt::usdt_impl::task_poll_end::probe_inner::h476c2bd4bfb2eb31@GOTPCREL]
jmp .LBB77_19
- call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL]
mov rdi, rax
call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
jmp .LBB77_41
call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
.LBB77_51:
mov r14, rax
mov esi, 384
mov edx, 128
mov rdi, rbx
.LBB77_34:
call qword ptr [rip + __rustc[de0091b922c53d7e]::__rust_dealloc@GOTPCREL]
.LBB77_35:
mov rdi, r14
call _Unwind_Resume@PLT
mov r15, rax
.LBB77_19:
- mov dword ptr [rsp + 280], 2
- lea rsi, [rsp + 280]
+ lea rdi, [rsp + 48]
+ call qword ptr [rip + <tokio::runtime::task::core::TaskIdGuard as core::ops::drop::Drop>::drop::hab02cca3e87fef69@GOTPCREL]
+ jmp .LBB77_22
+ call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL]
+ mov r15, rax
+.LBB77_22:
mov rdi, r14
- call tokio::runtime::task::core::Core<T,S>::set_stage::h6c06c0277547f365
+ call core::ptr::drop_in_place<tokio::runtime::task::harness::poll_future::{{closure}}::Guard<simple_echo_tcp::main::{{closure}},alloc::sync::Arc<tokio::runtime::scheduler::current_thread::Handle>>>::he33e022ff410fd70
mov rdi, r15
call qword ptr [rip + std::panicking::catch_unwind::cleanup::h80150e146981252a@GOTPCREL]
mov rcx, qword ptr [rbx + 40]
jmp .LBB77_25
call qword ptr [rip + core::panicking::panic_cannot_unwind::h2df093f6b1708ee2@GOTPCREL]
call qword ptr [rip + core::panicking::panic_in_cleanup::h8f68387bb6cbbf54@GOTPCREL] |
|
The docs found here say that:
But the assembly you shared contains relocations such as |
I think this is a necessary part of how I'm currently avoiding over-monomorphisation. Each probe can only have one semaphore, so each probe callsite that wants to check the semaphore must necessarily need to relocate for the global. I'd like to figure out why the monomorphisation is causing issues with the linker because then we could eliminate most of the semaphores entirely. I did also play around with trying to move the poll probes higher up the stack where the runtime is still polymorphic, but I wasn't too happy with it. Maybe I can rework/reconsider it. |
Motivation
As discussed on discord:
Solution
Using Userspace Statically Defined Tracing (USDT) we expose lightweight probes that can be attached to at runtime with tools like bpftrace or dtrace. This is inspired by https://github.com/oxidecomputer/usdt.
The new functionality is behind a new unstable feature flag. Currently it only exposes some basic task events and not yet any resource events.
See USDT in the wild: